BMC Medical Research Methodology — Latest Matching Preprints

1

Simulation-Based Comparison of ControlledInterrupted Time Series (CITS) and Multivariable Regression

ORWA, F. O.; Mutai, C.; Nizeyimana, I.; Mwangi, A.

2026-04-13 health policy 10.64898/2026.04.10.26350670 medRxiv

Top 0.1%

22.7%

Show abstract

When randomized controlled trials are impractical, interrupted time series designs offer a rigorous quasi-experimental approach to assess population level policies. Indeed, in the context of quasi-experimental designs (QEDs), the Interrupted Time Series (ITS) method is commonly thought of as the most robust. But interrupted time series designs are susceptible to serial correlation and confounding by time-varying factors associated with both the intervention and the outcome, which may result in biased inference. Thus, we provide a simulation-based contrast of controlled interrupted time series (CITS) and multivariable regression (multivariable negative binomial regression) for estimation of policy effects in count time series data. These approaches are widely used in policy evaluations, yet their comparative performance in typical population health settings has rarely been examined directly. We tested both approaches within a variety of data generating situations, differing in the series length, intervention effect size, and magnitude of lag-1 autocorrelation. Bias, standard error calibration, confidence interval coverage, mean squared error, and statistical power were assessed for performance. Both methods gave unbiased estimates for moderate and large intervention effects, although bias was more pronounced for small effects, particularly in short series. Although the point estimate performance was similar, inferential properties varied significantly. CITS always had smaller mean squared error, better consistency between model based and empirical standard errors, and confidence interval coverage near the 95% nominal levels over weak to moderate autocorrelation. By contrast, multivariable regression was more sensitive to serial dependence, leading to underestimated standard errors and undercoverage, especially at moderate to high autocorrelation, regardless of Newey-West adjustments. These findings show the benefits of using a concurrent control series and the importance of structurally accounting for serial correlation when studying population level policies with time series data.

2

Integrating stakeholder perspectives in modeling routine data for therapeutic decision-making

Pfaffenlehner, M.; Dressing, A.; Knoerzer, D.; Wagner, M.; Heuschmann, P.; Scherag, A.; Binder, H.; Binder, N.

2026-02-18 epidemiology 10.64898/2026.02.18.26346074 medRxiv

Top 0.1%

18.9%

Show abstract

BackgroundRoutinely collected health data are increasingly used to generate real-world evidence for therapeutic decision-making. Yet, stakeholders, including clinicians, pharmaceutical industry representatives, patient advocacy groups, and statisticians, prioritize different aspects of data quality, analysis, and interpretation. Without explicit consideration of these perspectives, analyses risk being fragmented, misaligned with end-user needs, or lacking transparency. MethodsWe developed a stakeholder-inclusive conceptual framework for modeling routine health data, informed by an interdisciplinary workshop and supported by targeted literature examples. The framework maps stakeholder priorities to methodological requirements and identifies analytical strategies that enable integration of diverse perspectives. ResultsClinicians prioritize interpretability and clinical relevance; the pharmaceutical industry emphasizes regulatory compliance and real-world evidence generation; patient groups highlight transparency, inclusion of patient-reported outcomes, and privacy protection; and statisticians focus on bias control and methodological rigor. Our framework illustrates how these priorities can be explicitly incorporated into modeling strategies. Multistate models exemplify a methodological approach that operationalizes these requirements by capturing dynamic disease trajectories, integrating intermediate outcomes, and offering graphical interpretability. Beyond specific methodological choices, clinical research relies fundamentally on statistical expertise. Depending on the research goal, statisticians roles can range from providing statistical consultations for standard analyses to applying or adapting advanced methods for more complex analyses to developing new methods for research questions that require novel approaches due to their specific characteristics. ConclusionsThe stakeholder-inclusive framework provides methodological guidance for designing analyses of routine health data that are clinically meaningful, scientifically rigorous, and socially acceptable. By aligning the research question with the intended perspective from the beginning, it supports more robust and transparent evidence generation, with multistate models serving as a flexible tool to operationalize this integration.

3

Causal analyses using education-health linked data for England: a case study

De Stavola, B. L. L.; Aparicio Castro, a.; Nguyen, V. G.; Lewis, K. M.; Dearden, L.; Harron, K.; Zylbersztejn, A.; Shumway, J.; Gilbert, R.

2026-03-19 health policy 10.64898/2026.03.13.26348340 medRxiv

Top 0.1%

18.2%

Show abstract

IntroductionThis article summarises lessons learnt from the Health Outcomes for young People throughout Education (HOPE) Study and serves as a real world, transferable application for addressing causal questions using administrative data. The HOPE study applied causal methods to analyses of administrative data in Education and Child Health Insights from Linked Data (ECHILD) aimed at studying the effectiveness of provision for special educational needs and disability (SEND) on health and education outcomes. MethodsDefining causal questions regarding the impact of SEND provision required judicious mapping of the question onto the data, leading to the selection of appropriate measures of effect, transparent handling of the data and control of confounding factors to estimate effects. We adopted the target trial emulation framework to guide these steps. Having encountered specific computational challenges in estimating the effects of interest, we simulated data that resembled the HOPE study and used them to practice the implementation of alternative estimation methods and to study impact of some of their assumptions. ResultsThe creation and analysis of the simulated data provided valuable insights. First, we learned the importance of aligning the target of estimation with the causal question at hand. Second, we observed how deviations from assumptions specific to each estimation method can affect results. Third, we highlighted the benefits of employing alternative estimation methods as sensitivity tools that can aid the interpretation of the resulting estimates. Finally, we offer user-friendly code in two programming languages (R and Stata) and accompanying simulated data to facilitate the implementation of these methods for similar causal questions. ConclusionWe recommend users of administrative data to fully specify -and possibly revise- the causal questions they wish to address and to carefully examine and compare assumptions, implementation and results obtained using alternative estimation methods.

4

Comparative performance of the concurrent comparator design with existing vaccine safety surveillance approaches on real-world observational health data

Chattopadhyay, S.; Bu, F.; Schuemie, M. J.; McLeggon, J.-A.; Westlund, E.; Hripcsak, G.; Ryan, P. B.; Suchard, M. A.

2026-01-26 public and global health 10.64898/2026.01.25.26344812 medRxiv

Top 0.1%

17.9%

Show abstract

BackgroundIt is critical public health concern to identify safety signals originating from wide-scale immunization efforts. Such safety signals may be identified from spontaneous reports and other data sources. Although some work has been done on the best methods for vaccine safety surveillance, there is a scarcity of information on how these perform in analyses of real-world data. MethodsWe use four administrative claims databases and one electronic health record (EHR) database to evaluate the operating characteristics of the recently proposed concurrent comparator, self-controlled case series, historical comparator and case-control epidemiological designs for vaccine safety, using negative control outcomes (unrelated to the vaccine), imputed positive control outcomes, and one real-world positive control outcome (myocarditis or pericarditis) for COVID-19. In this evaluation, we consider vaccine exposures for COVID-19, 2017-2018 seasonal influenza, H1N1pdm flu, Human Papillomavirus (HPV), and Varicella-Zoster. The methods are compared based on type 1 error, power of association detection, and proportion of non-finite association estimates produced. ResultsAll methods exhibit systematic error, leading to type 1 errors that are greater than the nominal (= 0.05) threshold, often by a substantial amount. To restore near-nominal type 1 error, we carry out empirical calibration based on the large set of negative controls. Post-empirical calibration, the self-controlled case series designs had the highest power overall, closely followed by the concurrent comparator designs. However, concurrent comparator analyses often produced a higher proportion of non-finite estimates. ConclusionOur results indicate that there remains non-negligible systematic error under the concurrent comparator. In terms of statistical performance, the concurrent comparator designs show promising results in some scenarios, regularly outperforming the historical comparator and case-control designs, but often producing non-finite estimates. Future work building on the concurrent comparator design is required to construct more efficient designs with lower systematic error.

5

Generation of Synthetic Data in Health Surveys Using Large Language Models

Villarreal-Zegarra, D.; Bellido-Boza, L.

2026-01-30 health informatics 10.64898/2026.01.27.26345015 medRxiv

Top 0.1%

17.9%

Show abstract

BackgroundGenerating synthetic data using artificial intelligence, such as large language models (LLMs), is a useful strategy in public health because it can reduce time and costs, expand access to data, and facilitate information sharing without compromising confidentiality. ObjectiveTo evaluate the consistency and psychometric plausibility of synthetic data generated by an LLM to simulate the responses of survey participants (user personas) in a national health survey in Peru. MethodsWe conducted a cross-sectional study based on the National Health Satisfaction Survey (ENSUSALUD 2016) of ambulatory health service users. We used the GPT-OSS-20B model to generate synthetic responses in Spanish, conditioned on narrative profiles derived from sociodemographic and clinical variables. We evaluated consistency between responses and profile characteristics (sex, age, and comorbidities) using performance metrics (accuracy, precision, recall, F1 score, and AUC). We compared distributions between real and synthetic data using t-tests and chi-square tests. For latent variables, we conducted confirmatory factor analyses of the PHQ-9, PHQ-8, and GAD-7 (WLSMV; polychoric matrices) and estimated internal consistency ( and {omega}). We examined normality (Jarque-Bera test) and stability through correlations between real measures (PHQ-2 and EQ-5D) and synthetic measures (PHQ-2, PHQ-8, PHQ-9, GAD-2, and GAD-7). ResultsThe model showed strong concordance with the profile for sex, age, and chronic disease status, with metrics close to 1 for most variables; overall consistency was high in the vast majority of cases. The synthetic PHQ-9, PHQ-8, and GAD-7 instruments showed optimal factor fit and high internal consistency. Synthetic measures were positively and significantly correlated with the real PHQ-2 and negatively correlated with EQ-5D, with moderate to high correlations, particularly for PHQ-8/PHQ-9 and GAD-7. ConclusionsAn LLM can generate plausible synthetic data for health surveys when its output is conditioned on user personas, preserving high coherence with demographic and clinical characteristics and maintaining adequate psychometric properties in depression and anxiety scales. However, relevant deviations were identified (e.g., overestimation of obesity, unexpected distributions in some variables, and missing values in a sensitive item), which supports the need for rigorous validation and bias control before using these data for inferential purposes or public policy.

6

The Independence of Discrimination and Calibration in Clinical Risk Prediction: Lessons from a Multi-Timeframe Diabetes Prediction Framework

OReilly, E.; Kurakovas, T.

2026-02-14 health informatics 10.64898/2026.02.12.26346147 medRxiv

Top 0.1%

16.8%

Show abstract

BackgroundClinical risk prediction models are typically evaluated by discrimination (area under the receiver operating characteristic curve, AUC), with calibration receiving less attention. We developed a multi-timeframe diabetes prediction framework emphasizing calibration and used synthetic data validation to investigate whether good discrimination guarantees good calibration. MethodsWe generated 500,000 synthetic patients using published epidemiological parameters from QDiabetes-2018, FINDRISC, and the Diabetes Prevention Program. The framework comprises a discrete-time survival ensemble with isotonic calibration, producing predictions at 6, 12, 24, and 36 months with bootstrap confidence intervals. We evaluated discrimination (AUC), bin-level calibration (expected calibration error, ECE), calibration-in-the-large (observed-to-expected ratio), and clinical utility (decision curve analysis). We compared performance against QDiabetes-2018 implemented on the same synthetic cohort. ResultsDespite achieving excellent discrimination (AUC = 0.844, 95% CI: 0.840-0.848) and low bin-level calibration error (ECE = 0.006), the framework systematically overpredicted risk by 50%: mean predicted probability was 8.4% versus observed rate of 5.6% (observed-to-expected ratio = 0.66, 95% CI: 0.65-0.67). This miscalibration occurred despite isotonic regression on a held-out calibration set. Overprediction was present in 9 of 10 risk deciles. Risk stratification remained valid (23.5-fold separation, 95% CI: 22.8-24.3, between highest and lowest tiers), confirming that discrimination was preserved. QDiabetes-2018 achieved comparable discrimination (AUC = 0.831) with better calibration (O:E = 0.89). Decision curve analysis showed net benefit across threshold range 5-30%, though recalibration would improve clinical utility. ConclusionsGood discrimination does not guarantee good calibration. Our primary finding is negative: isotonic calibration failed to produce well-calibrated predictions even on synthetic data from a single generator. This has important implications for model deployment, where distribution shift is inevitable. We recommend that prediction model studies report calibration-in-the-large alongside bin-level metrics, as ECE alone can be misleading when risk distributions are skewed. Recalibration on deployment populations will likely be necessary for any prediction model, regardless of development-phase calibration performance. Key MessagesO_ST_ABSWhat is already knownC_ST_ABSO_LIClinical prediction models require both discrimination (ranking patients correctly) and calibration (accurate probability estimates) C_LIO_LIIsotonic regression is a recommended approach for post-hoc calibration C_LIO_LIExpected calibration error (ECE) is commonly reported as a summary calibration metric C_LI What this study addsO_LIDemonstrates empirically that excellent discrimination (AUC = 0.844) can coexist with substantial miscalibration (50% overprediction) C_LIO_LIShows that low ECE can be misleading when most patients fall in low-risk deciles C_LIO_LIProvides evidence that isotonic calibration on held-out data may not generalize even within synthetic data from one generator C_LIO_LIDemonstrates a discrete-time survival architecture that reduces monotonicity violations to <0.1% C_LI How this study might affect research, practice, or policyO_LIPrediction model studies should report calibration-in-the-large (O:E ratio) alongside ECE C_LIO_LIDevelopers should expect recalibration to be necessary when deploying to new populations C_LIO_LIClaims of calibrated prediction should be viewed skeptically without comprehensive calibration assessment C_LI

7

Transportability of missing data models across study sites for research synthesis

Thiesmeier, R.; Madley-Dowd, P.; Ahlqvist, V.; Orsini, N.

2026-03-10 epidemiology 10.64898/2026.03.09.26347913 medRxiv

Top 0.1%

14.9%

Show abstract

IntroductionSystematically missing covariates are a common challenge in medical research synthesis of quantitative data, particularly when individual participant data cannot be shared across study sites. Imputing covariate values in studies where they are systematically unobserved using information from sites where the covariate is observed implicitly assumes similarity of associations across studies. The behaviour of this assumption, and the bias arising from violating it, remains difficult to qualitatively reason about. Here, we evaluated a two-stage imputation approach for handling systematically missing covariates using simulations across a range of statistical and causal heterogeneity scenarios. MethodsWe conducted a simulation study with varying degrees of between-study heterogeneity and systematic differences in model parameters. A binary confounder was set to systematically missing in half of the studies. Study-specific effect estimates were combined using a two-stage meta-analytic model. The performance of the imputation approach was evaluated with the primary estimand being the pooled conditional confounding-adjusted exposure effect across all studies. ResultsBias in the pooled adjusted effect estimate was small across scenarios with low to substantial between-study heterogeneity. Bias increased monotonically with increasingly pronounced differences in causal structures across study sites. Coverage remained close to the nominal level under low to substantial between-study heterogeneity, but deteriorated markedly as differences in causal structures between study sites became more severe. ConclusionThe two-stage cross-site imputation approach produced valid pooled effect estimates across a wide range of simulated scenarios but showed monotonic sensitivity to differences in causal structures across studies. The results provide insight into the conditions under which cross-site imputation may be appropriate for handling systematically missing covariates in research synthesis.

8

From Study Design to Executable Code: Automating Target Trial Emulation with Large Language Models

Kim, H.; Kim, M.; Kim, S.; You, S. C.

2026-03-14 health informatics 10.64898/2026.03.13.26348306 medRxiv

Top 0.1%

14.7%

Show abstract

IntroductionImplementing target trial emulation (TTE) study methods as end-to-end executable analytic code is technically demanding, and producing standardized, reproducible scripts consistently across research teams remains a persistent challenge. We aimed to develop a framework that translates free-text study descriptions into standardized analytic specifications and executable Strategus R scripts for the Observational Health Data Sciences and Informatics (OHDSI) ecosystem. MethodsWe developed THESEUS (Text-guided Health-study Estimation and Specification Engine Using Strategus), which operates through two sequential steps. Large language models (LLMs) first map descriptions of the study into a constrained JavaScript Object Notation (JSON) schema (standardization step), after which the structured specifications are converted into R scripts with a self-auditing loop for error correction (code generation step). We evaluated eight proprietary LLMs using texts extracted from the methods section of 15 OHDSI-based TTE studies, and externally validated the framework on texts from 5 non-OHDSI studies, across three input settings: primary analysis text only, full analyses text, and full methods sections. Standardization was evaluated at the study-level (whether all parameters in a study were correctly extracted) and at the field-level (sensitivity and false positive rate per individual parameter) with field-level evaluation applied to the full analyses text and full methods sections input settings. Code generation was assessed by executability of the produced R scripts before and after self-auditing. ResultsIn the standardization step, study-level accuracy across models ranged from 0.91 to 0.98 for primary analysis, 0.67 to 0.87 for full analyses, and 0.67 to 0.85 for full methods sections in OHDSI studies, whereas the corresponding ranges were 0.73 to 0.93, 0.60 to 0.87, and 0.27 to 0.47 in non-OHDSI studies. At the field-level, sensitivity across models under the full analyses text input setting ranged from 0.73 to 0.90 with 0.27 to 0.67 false positives per study in OHDSI studies, and from 0.71 to 0.90 with 0.20 to 1.00 false positives per study in non-OHDSI studies, depending on input setting. For code generation, first-run executability ranged from 0.80 to 1.00 for OHDSI studies and improved to 0.93 to 1.00 after self-auditing. In non-OHDSI studies, first-run executability ranged from 0.60 to 1.00, improving to 1.00 after self-auditing. DiscussionTHESEUS demonstrates that pairing a standardized data model with a structured analysis framework enables reliable LLM-powered automation of the coding step in observational research. THESEUS supports the reliable translation of natural-language study descriptions into executable, shareable code in standardized observational research settings. This approach has the potential to lower the technical barriers to participation in observational research for a broader range of investigators.

9

Standardisation of terminology, calculation and reporting for assigning exposure duration to drug utilisation records from healthcare data sources: the CreateDoT framework

Riera-Arnau, J.; Paoletti, O.; Gini, R.; Thurin, N. H.; Souverein, P. C.; Abtahi, S.; Duran, C. E.; Pajouheshnia, R.; Roberto, G.

2026-02-19 epidemiology 10.64898/2026.02.18.26346576 medRxiv

Top 0.1%

12.6%

Show abstract

BackgroundIn pharmacoepidemiological studies, days of treatment (DoT) duration associated with individual electronic drug utilization records (DUR) are usually missing. Researcher-defined duration (RDD) calculation approaches, as opposed to data-driven approaches, can be used to estimate DoT based on the specific choices and assumptions made by investigators. These are usually underreported or even undocumented. We aimed to develop a framework for the standardization of terminology, formulas, implementation, and reporting of possible RDD approaches. MethodsA systematic classification of RDD calculation approaches was developed via expert consensus. Universal concepts used to operationalise RDDs were identified and described using standard terminologies. An open-source R function, CreateDoT, was created to implement the formulas universal concepts as input parameter. A step-by-step workflow was developed to facilitate implementation and reporting. ResultsRDD approaches were classified in two main classes: I) daily dose (DD)-based calculation approaches (n=3 formulas), and II) fixed-duration approaches (n=2). Seven universal concepts were identified to describe the five corresponding generalized formulas for DoT calculation. Input parameters of the CreateDoT function can be retrieved from source data through its mapping to universal concepts, or inputted by the investigator based on the chosen calculation approach. The input file structure itself represents a standard reporting template for documenting investigators assumptions and methodological choices adopted for DoT calculation. ConclusionsThe CreateDoT framework can facilitate the documentation and reporting of RDD approaches for DoT calculation, increasing transparency and reproducibility of pharmacoepidemiological studies regardless of the data model used, and facilitates sensitivity analyses to evaluate the impact of alternative assumptions in DoT calculation.

10

Outcome Risk Modeling for Disability-Free Longevity: Comparison of Random Forest and Random Survival Forest Methods

Vanghelof, J. C.; Tzimas, G.; Du, L.; Tchoua, R.; Shah, R. C.

2026-02-17 health informatics 10.64898/2026.02.13.26346264 medRxiv

Top 0.1%

12.4%

Show abstract

BackgroundWhen creating risk prediction models for time-to-event data, methods that incorporate time are typically used. Random survival forests (RSF), an extension of random forests (RF), are one such class of models. We compared RSF to RF in the context of time-to-event outcomes in the ASPirin in Reducing Events in the Elderly (ASPREE) randomized controlled trial. We hypothesize that RSF will have superior discrimination and calibration versus RF. MethodsParticipants from ASPREE residing outside the US or with missing data were excluded. A total of 2,291 participants were assigned 1:1 into training and test sets. RF and RSF models were trained using a total of 115 measures as candidate predictors. The outcome of interest was the earliest of incident dementia, physical disability, or death. ResultsThe primary endpoint occurred in 10.5% of participants. Discrimination was similar between the models: sensitivity ([~]0.75), specificity ([~]0.57), positive predictive value ([~]0.17), time dependent AUC ([~]0.71), and Harrells concordance ([~]0.73). Calibration was likewise similar, Brier score ([~]0.09). DiscussionThe RF and RSF models exhibited comparable discrimination and calibration. We conclude that RSF may not always lead to more accurate predictions of outcomes compared to RF. Further examination in different clinical trial cohorts is needed to better understand the context in which adding time into outcomes risk modeling adds value.

11

Bias and Variance of Adjusting for Instruments

Hripcsak, G.; Anand, T.; Chen, H. Y.; Zhang, L.; Chen, Y.; Suchard, M. A.; Ryan, P. B.; Schuemie, M. J.

2026-03-15 epidemiology 10.64898/2026.03.13.26348328 medRxiv

Top 0.1%

12.2%

Show abstract

Propensity score adjustment is commonly used in observational research to address confounding. Controversy persists about how to select covariates as possible confounders to generate the propensity model. A desire to include all possible confounders is offset by a concern that more covariates will augment bias or increase variance. Much of concern is over instruments, which are variables that affect the treatment but not the outcome. Adjusting for an instrument has been shown to increase bias due to unadjusted confounding and to increase the variance of the effect estimate. Large-scale propensity score (LSPS) adjustment includes most available pre-treatment covariates in its propensity model. It addresses instruments with a pair of diagnostics, ceasing the analysis if any covariate exceeds a correlation coefficient of 0.5 with the treatment and checking for an aggregation of instruments with equipoise reported as a preference score. Our simulation assesses the impact of adjusting for instruments in the context of LSPSs diagnostics. In our simulation, even when the variance of the treatment contributed by the adjusted instrument(s) exceeds an unadjusted confounder by over twenty-fold, when the correlation between the instrument(s) and the treatment was less than 0.5 and the equipoise was greater than 0.5, the additional shift in the effect estimate due to adjusting for the instrument(s) was less than the shift due to confounding by itself. Therefore, we find in this simulation that adjusting for instruments contributed a minor amount of bias to the effect estimate. This simulation aligns well with a previous assessment of the impact of adjusting for instruments and with separate empirical evidence that adjusting for many covariates surpasses attempts to identify a limited set of confounders.

12

Comparing optimal transport and machine learning approaches for databases merging in scenarios involving missing data in covariates.Application to Medical Research

N'kam suguem, F.; DEJEAN, s.; Saint-Pierre, P.; Savy, N.

2026-01-26 bioinformatics 10.64898/2026.01.23.701369 medRxiv

Top 0.1%

10.0%

Show abstract

MotivationOne of the challenges encountered when merging heterogeneous observational clinical datasets is the recoding of categorical target variables that may have been measured differently across data sources. Standard machine learning-based approaches, such as Multiple Imputation by Chained Equations and the k-Nearest Neighbours method are compared with an Optimal Transport based algorithm (OTre-cod) when databases are altered by missing values in covariates or by imbalanced groups. The empirical performance in these realistic data integration settings remains underexplored. ResultsA comprehensive simulation study was conducted, varying sample size, group imbalance, signal-to-noise ratio, and mechanisms of missing data. The results demonstrate that OTrecod consistently achieves higher recoding accuracy compared with Multiple Imputation by Chained Equations and k-Nearest Neighbours, particularly in large, imbalanced and weak-signal scenarios. These findings are further illustrated using subsets of the National Child Development Study, where OTrecod and Multiple Imputation by Chained Equations minimised the distributional divergence between recoded social-class scales, while k-Nearest Neighbours produced less stable results. Availability and ImplementationThe source code supporting this study is publicly available at https://github.com/FloAI/CompareOT.

13

An AI Agent for Automated Causal Inference in Epidemiology

Liu, H.; Shi, K.; li, A.; Li, X.; Chu, J.; Xue, Y.; Cen, S.; Wang, Y.; Zhang, T.

2026-02-06 epidemiology 10.64898/2026.02.06.26345723 medRxiv

Top 0.1%

9.9%

Show abstract

ObjectiveTo address the inefficiency, subjectivity, and high expertise barrier of traditional epidemiological causal inference, this study designed, developed, and validated an AI-powered agent (EpiCausalX Agent) to automate the end-to-end workflow. It integrates cross-database literature retrieval, intelligent causal reasoning, and Directed Acyclic Graph (DAG) visualization to provide a reliable, accessible tool for researchers. Materials and MethodsBuilt on the LangChain 1.0 framework with a layered design (Agent/Tool/Storage/Utility Layers), the agent uses the DeepSeek V3.2 LLM and ReAct paradigm for dynamic task orchestration. Four specialized tools were integrated including multi-database retrieval with 7 databases, causal inference based on Hills criteria and DAG logic, automated DAG drawing using NetworkX and Matplotlib, and clinical standard query. Performance was validated via unit tests, workflow verification, and usability testing. ResultsThe agent achieved full-process automation. It efficiently retrieves and synthesizes literature, automatically identifies confounders and mediators, and generates standardized interactive DAGs. It produces evidence-based, traceable conclusions aligned with established epidemiological knowledge. Its user-friendly natural language interface enables seamless use by non-technical researchers who complete task initiation quickly without operational confusion. The agent is publicly available on WeChat Mini Program for easy access. ConclusionEpiCausalX Agent advances intelligent, automated epidemiological research. By integrating domain expertise with AI agent technology, it overcomes limitations of manual methods and general LLMs to provide a specialized, verifiable, efficient solution. It has broad applications in observational research, clinical study design, and education to enhance productivity and lower barriers to rigorous causal analysis.

14

PRE-CISE: A PRE-calibration Coverage, Identifiability, and SEnsitivity analysis workflow to streamline model calibration

Gracia, V.; Goldhaber-Fiebert, J. D.; Alarid-Escudero, F.

2026-03-02 health policy 10.64898/2026.02.27.26346591 medRxiv

Top 0.1%

8.5%

Show abstract

PurposeWe introduce PRE-CISE, a pre-calibration workflow that integrates coverage analysis, local sensitivity, and collinearity diagnostics to streamline model calibration and transparently address nonidentifiability. We demonstrate the benefits of PRE-CISE using a four-state Sick-Sicker Markov testbed and a COVID-19 case study. MethodsPRE-CISE begins with a coverage analysis to verify that model outputs generated with parameter sets drawn from their prior distribution span calibration targets, followed by local sensitivities to quantify the influence of parameters on model outputs, guiding the resizing of the prior distribution bounds to improve coverage. Identifiability is then assessed via collinearity analysis; large indices indicate practical nonidentifiability. For the testbed model, we calibrated 3 parameters to survival, prevalence, and the proportion of Sick to Sicker at 10, 20, and 30 years. For the COVID-19 model, we calibrated 11 parameters to match daily confirmed incident cases. Bayesian calibration was conducted on both analyses. ResultsCoverage analyses flagged initial misfits; local sensitivities identified the Sick-to-Sicker transition probability has a greater effect on model outputs, and resizing its prior distribution bounds improved coverage. Collinearity analyses showed that combining multiple calibration targets across time points enabled recovery of all three parameters. In the COVID-19 model, local sensitivity analyses prioritized time-varying detection rates and contact-reduction effects, reducing the search space, thereby improving calibration efficiency. Daily incident case calibration targets yielded collinearity indices below practical thresholds (e.g., < 15) for all parameter combinations, whereas weekly calibration targets were larger and closer to the cutoff. ConclusionsPRE-CISE provides a practical, transparent pathway that helps modelers refine prior distribution bounds and calibration targets before intensive calibration, improving uncertainty reporting and strengthening the reliability of model-based health policy analyses.

15

Comparison of methods for assessing effects of risk factors on disease progression in Mendelian randomization under index event bias

Zhang, L.; Higgins, I. A.; Dai, Q.; Gkatzionis, A.; Quistrebert, J.; Bashir, N.; Dharmalingam, G.; Bhatnagar, P.; Gill, D.; Liu, Y.; Burgess, S.

2026-03-02 epidemiology 10.64898/2026.02.26.26347193 medRxiv

Top 0.1%

6.6%

Show abstract

Mendelian randomization has emerged as a transformative approach for inferring causal relationships between risk factors and disease outcomes. However, applying Mendelian randomization to disease progression - a critical step in validating pharmacological targets - is hampered by index event bias. This form of selection bias occurs because analyses of disease progression are necessarily restricted to individuals who have already experienced the disease event. Here, we present a comprehensive evaluation of statistical methods designed to mitigate index event bias, including inverse-probability weighting, Slope-Hunter, and multivariable methods. We compare the performance of these methods in simulations and applied examples. Inverse-probability weighting methods reduce bias, but require individual-level data and will only fully eliminate bias when the disease event model is correctly specified. Slope-Hunter performed poorly in all simulation scenarios, even when its assumptions were fully satisfied. Multivariable methods worked best when including genetic variants that affect the incident disease event. However, if these genetic variants also affect disease progression directly, then the analysis will suffer from pleiotropy. Hence, if the same biological mechanisms affect disease incidence and progression, then multivariable methods will have little utility. But in such a case, analyses of disease progression are less critical, as conclusions reached from analyses of disease incidence are likely to hold for disease progression. Our findings indicate that no single method is a universal solution to provide reliable results for the investigation of disease progression. Instead, we propose a strategic framework for method selection based on data availability and biological context.

16

Methodological Guidance for Predictor Variable Selection for Adolescent Smoking Outcomes in Global Youth Tobacco Survey Using R and Python

Ng'ambi, W. F.; Zyambo, C.; Kazembe, L.

2026-02-17 epidemiology 10.64898/2026.02.14.26346305 medRxiv

Top 0.1%

6.4%

Show abstract

BackgroundThe Global Youth Tobacco Survey (GYTS) is widely used to monitor tobacco use among adolescents worldwide. However, inconsistent analytical approaches particularly in handling complex survey designs and predictor selection limit comparability across countries, survey waves, and software platforms. Although much of the GYTS literature relies on proprietary tools such as SAS and SPSS, practical and transparent guidance on implementing reproducible, theory-informed analyses remains limited. A unified workflow that respects the surveys design while supporting cross-platform implementation is needed. MethodsWe developed a reproducible, open-source workflow for analysing GYTS data using R and Python. In R, analyses were conducted using the survey package (svydesign and svyglm) with constrained stepwise selection via stepAIC. In Python, a custom constrained stepwise procedure was implemented using statsmodels generalized linear models. The workflow explicitly incorporates survey weights, stratification, and clustering; harmonises variables across countries; protects a priori demographic covariates; and ensures consistent treatment of categorical predictors. The approach is illustrated using data from Zambia (n = 2,959) and pooled data from Ghana, Mauritius, Seychelles, and Togo (n = 15,914). Predictor selection was guided by Social Cognitive Theory and evidence from systematic reviews. ResultsThe constrained selection framework consistently retained key demographic variables (age, sex, and grade) while allowing data-driven selection of modifiable predictors using the Akaike Information Criterion. When identical constraints were applied, the R and Python implementations selected identical models and produced nearly equivalent point estimates (adjusted odds ratio differences <0.01), although Python-based confidence intervals did not account for clustering. Of 18 candidate predictors across individual, social, media, and policy domains, 14 were retained. The strongest independent predictors included awareness of tobacco products (OR = 5.61, 95% CI: 4.65- 6.78), peer smoking (OR = 4.57, 95% CI: 3.34-6.25), and exposure to tobacco marketing (OR = 2.34, 95% CI: 1.89-2.91). ConclusionsThis study provides a generalisable, theory-informed framework for predictor selection in complex survey data using open-source tools. The workflow supports consistent analyses across countries, survey waves, and software platforms, and is transferable to other youth and adult population surveys. All code and harmonisation resources are openly available to support reproducibility and adaptation. Plain-Language SummaryO_LIWhat we asked: Can we predict adolescent smoking using GYTS data in a way that is easy to follow and reproducible across software? C_LIO_LIWhat we did: Built a single workflow that respects survey design (weights, strata, clusters) and selects predictors using four explicit criteria: theoretical grounding in Social Cognitive Theory, empirical support from prior studies, relevance for intervention, and cross-country validity. Core demographics (age, sex, grade, region) were protected as essential confounders, while other predictors were selected based on statistical fit. The workflow runs equivalently in R and Python. C_LIO_LIWhy it matters: Many GYTS studies use weights only and ignore clustering and stratification, which makes confidence intervals too narrow. More importantly, most analyses include variables arbitrarily or let software drop important confounders automatically. Our approach ensures theoretically meaningful, policy-relevant variables are retained, producing more reliable and actionable results for prevention programs. C_LI

17

Protocol for LLM-Generated CONSORT Report for Increased Reporting: A Parallel-Arm Randomized Controlled Trial (Protocol)

Krauska, A. N.; Rohe, K.

2026-04-17 health policy 10.64898/2026.04.15.26350926 medRxiv

Top 0.1%

6.2%

Show abstract

Background Randomized controlled trials (RCTs) often have incomplete methods reporting despite widespread adoption of the CONSORT guideline. The editorial process is supposed to detect these shortcomings and request clarifications from authors, which is time-consuming. We developed an LLM-based CONSORT Rohe Nordberg Report that highlights which CONSORT items appear fully or partially reported and checks page references claimed by authors, and then creates follow up questions for authors to more easily correct missing information. Methods This parallel-arm, superiority RCT will randomize eligible RCT submissions (after desk screening) 1:1 into intervention (editorial team and authors receive the Rohe Nordberg Report) or control (standard editorial review only). The primary outcome is whether manuscripts improve their reporting of CONSORT items in the Methods and Results sections between the original submission and first revision. This will be assessed by blinded human reviewers who evaluate the textual changes for improvements between the original and revised manuscripts for each relevant CONSORT item. Secondary outcomes include time to editorial decisions, rejection and non-resubmission rates, if authors can correctly identify where CONSORT items are reported, and extent of revisions. Human evaluators will be blinded to whether the manuscript was in the intervention or control group. Discussion By providing authors and the editorial team with specific follow up questions for each underreported CONSORT item, we hypothesize that basic underreporting will be more efficiently detected and corrected. Using blinded human reviewers as the primary outcome assessors ensures a rigorous, unbiased evaluation. If successful, this approach may help align manuscripts more closely with CONSORT standards, ultimately benefiting evidence synthesis.

18

TrialScout links published results to trial registrations using a large language model

Ahnström, L.; Bruckner, T.; Aspromonti, D. A.; Caquelin, L.; Cummins, J.; DeVito, N. J.; Axfors, C.; Ioannidis, J. P. A.; Nilsonne, G.

2026-03-17 epidemiology 10.64898/2026.03.15.26348383 medRxiv

Top 0.1%

6.2%

Show abstract

BackgroundMultiple stakeholders need to locate results of registered clinical trials but frequently struggle to find them. Summary results of clinical trials are often not published in trial registries, and publications containing trial results are often not explicitly linked to their respective trial registrations. Finding these results is important to researchers, systematic reviewers, research funders, regulators, clinical practitioners, and patients. MethodsWe developed TrialScout, a computer program that uses a large language model to match clinical trials registered on ClinicalTrials.gov with corresponding result publications indexed in PubMed. TrialScouts performance was evaluated through comparison to human-coded matches from previous studies of results reporting rates. Subsequently, TrialScout was applied to a random sample of 9,600 completed or terminated trials. ResultsTrialScout had a sensitivity of 92.5% and a specificity of 81.2% compared to human coders. Manual review of 200 cases where TrialScout disagreed with human researchers showed that a majority (123/200, 61.5%, 95% CI, 54.4-68.3%) of disagreements were due to human errors. When used on 9,600 sampled trials in ClinicalTrials.gov, TrialScout found result publications for 6,110 (63.6%) of trials. DiscussionTrialScout reliably located results of completed clinical trials. The tool offers benefits in terms of speed and efficiency. Estimating TrialScouts accuracy is limited by the lack of a true gold standard. TrialScout can accelerate the process of locating trial results in the scientific literature and can assist in monitoring trial reporting practices.

19

Interpretable Lifestyle-Based Machine Learning Models for Ten-Year Cardiovascular Risk Prediction using data from the UK Biobank

Feng, Y.; Kunz, H.; Dziopa, K.

2026-02-01 health informatics 10.64898/2026.01.26.26344438 medRxiv

Top 0.1%

6.0%

Show abstract

BackgroundCardiovascular diseases (CVDs) remain the leading global cause of morbidity and mortality. In clinical practice, 10-year risk prediction tools such as the Pooled Cohort Equations, QRISK3, and SCORE2 are widely used because of their transparency and clinical trustworthiness, but they rely heavily on biomarkers and medical history. Hence, most recommendations concentrate on pharmaceutical or procedural management, and in many situations, crucial biomarker indicators are unavailable, making it difficult to precisely evaluate individual risk and select appropriate treatments. ObjectiveTo develop interpretable, lifestyle-based machine learning models for predicting 10-year risk of cardiovascular disease (including heart failure and atrial fibrillation), and more critically, to systematically compare interpretability algorithms and assess the cross-model consistency of the identified behavioural factors MethodsUsing UK Biobank data, logistic regression, random forest, and XGBoost models were trained on lifestyle (including sleep, smoking, diet, physical activity and electronic device use) and demographic variables only. Discrimination, calibration and interpretability were evaluated using permutation importance, SHapley Additive Explanations and Local Interpretable Model-agnostic Explanations), with subgroup analyses by sex and age to characterise heterogeneity in model behaviour and feature relevance. ResultsThe developed models demonstrated good discrimination, with XGBoost performing best (ROC-AUC 0.726 [95% CI 0.720-0.731]; PR-AUC 0.199), closely followed by logistic regression (ROC-AUC 0.721 [95% CI 0.716-0.726]; PR-AUC 0.192), while random forest showed slightly lower performance. Despite this similar performance, interpretability analyses revealed inconsistencies in models importance ranking of lifestyle factors. Age, sex, and smoking behaviours consistently emerged as key contributors across all interpretability methods, demonstrating strong cross-model agreement, while other lifestyle factors such as dietary patterns, physical activity, and sleep showed model-dependent variation in their assigned importance. Subgroup analyses further indicated that modifiable behaviours (smoking, diet, sleep) were particularly influential among younger females, whereas cumulative exposures and family history were more dominant drivers in older males. ConclusionsLifestyle-only interpretable models offer a scalable and low-cost framework for cardiovascular risk assessment and behaviour-focused prevention, without requiring laboratory measurements or clinical testing. By comparing multiple interpretability algorithms across models, this study shows strong cross-method consistency and highlights lifestyle factors whose importance profiles differ from those in traditional biomarker-based calculators. These models can complement existing risk tools by highlighting modifiable behaviours, which is particularly valuable for younger adults. They can also support personalised feedback in digital-health settings to promote behavioural change. Overall, the findings support the development of transparent, behaviour-focused tools that enable accessible and equitable cardiovascular prevention.

20

Estimating Chronic Kidney Disease Stage Transitions from Irregular Electronic Health Record Data Using an Expectation-Maximization Framework

Qi, W.; Lobo, J. M.; Yan, G.; Ghenbot, R.; Sands, K. G.; Krupski, T. L.; Culp, S. H.; Otero-Leon, D.

2026-03-09 urology 10.64898/2026.03.08.26347890 medRxiv

Top 0.1%

5.0%

Show abstract

ObjectiveTo estimate chronic kidney disease (CKD) stage transition probabilities in patients with small renal masses (SRMs) using irregularly observed electronic health record (EHR) data, addressing challenges of interval censoring and irregular measurement intervals in real-world clinical practice. Data SourcesWe used EHR data from the University of Virginia Small Renal Mass (SRM) registry (2006-January 2026), capturing outpatient renal function data prior to any definitive treatment. CKD stages were defined using estimated glomerular filtration rate (eGFR) thresholds based on KDIGO guidelines. Study DesignThe final analytic cohort included 527 patients with at least two outpatient eGFR measurements prior to definitive treatment. We applied an expectation-maximization (EM) algorithm to estimate discrete-time CKD stage transition matrices while accounting for irregular follow-up and unobserved intermediate transitions. Transition matrices were estimated under 3-and 6-month cycle lengths overall as well as stratified by age and sex. Likelihood ratio tests were used to compare EM-based estimates with a naive one-step counting estimator. ResultsThe EM framework yielded clinically plausible transition structures dominated by self-transitions and progression primarily to adjacent CKD stages, with reduced spurious backward transitions relative to the naive estimator. Transition patterns were consistent across 3- and 6-month cycle lengths. Age-stratified analyses showed that older patients had slightly higher probabilities of progression to more advanced CKD stages compared with younger patients, whereas sex-stratified differences were minimal. Likelihood ratio comparisons supported the consistency of the EM-based models with the observed transition data in both the overall cohort and subgroup analyses. ConclusionsThe EM approach provides a principled and computationally efficient method for estimating CKD stage progression from irregularly observed EHR data, yielding transition matrices suitable for discrete-time decision-analytic and health economic models.